LA Crime¶

Author: Dhanush Vasa

Table of Contents:¶

  1. Introduction
  2. Data Collection
  3. Data Cleaning and Exploratory Analysis
  4. Modeling
  5. Interpretation of Results
  6. Conclusion

1. Introduction¶

The aim of this tutorial is to guide through the data science lifecycle, providing an introduction to various key concepts in data science. The stages of the data science lifecycle:

  1. Data Collection
  2. Data Cleaning
  3. Exploratory Analysis and Visualization
  4. Modeling
  5. Results Interpretation

Crime in Los Angeles is a complicated and dynamic issue, influenced by the city's size, diversity, and socioeconomic circumstances. As one of the major metropolitan areas in the United States, Los Angeles sees a wide range of criminal activity and violence. Understanding crime patterns and trends in Los Angeles is crucial for keeping the public safe and ensuring that law enforcement agencies allocate resources efficiently.

Many factors influence criminal activity in Los Angeles, including local demographics, economic situations, and the physical environment. Some locations of the city are repeating hotspots for various sorts of crime, which are frequently associated with population density, accessibility, or the presence of specific establishments. For example, theft may be more prevalent in commercial districts, whereas violent crimes may cluster in economically deprived regions. Temporal trends are particularly important, as certain crimes tend to increase at specific seasons of year, days of the week, or even hours of the day.

By using statistics to examine crime in Los Angeles, policymakers and law enforcement may uncover critical patterns and build focused preventive and intervention initiatives. For example, assessing geographic trends might assist police patrols be more effectively assigned to high-crime areas, whilst investigating temporal trends can influence resource deployment during peak hours. Furthermore, knowing the basic causes of crime, whether they are social, economic, or environmental, can help drive community-based programs to reduce criminal behavior. A comprehensive approach to studying crime in Los Angeles is critical for developing safer communities and building confidence between residents and law enforcement.

Important Note:-

  • In certain sections of the code, you may encounter warning messages. These can be safely ignored while focusing on the intended output of the code.

2. Data Collection¶

To begin any analysis, we must collect data that is relevant to the topic we want to answer. The quality of your machine learning model is directly proportional to the quality of the data it processes. A solid dataset guarantees that your model finds relevant patterns and draws intelligent conclusions. As a result, selecting the appropriate dataset is an important stage in the data science lifecycle.

In this tutorial, we will use the Crime_Data from 2020 to Present from LA dataset from OpenML, which provides a complete account of reported criminal episodes in a specific area beginning in 2020. This dataset contains critical properties such as unique report numbers, dates and times of reporting and occurrence, criminal descriptions with related codes, and specific geographic information such as region names, premises descriptions, and exact latitude/longitude coordinates. It also provides demographic information regarding victims, weapons used, and the status of each crime report.

For public safety agencies, analysts, and researchers, this dataset is significant because it makes it easier to identify patterns in crime, analyze hotspots, and assess the efficacy of law enforcement. By utilizing this data, we may investigate a range of use cases, including creating prediction models, comprehending the social elements that influence crime, and empowering decision-makers. Spatial data, for instance, might be used to identify high-crime areas and guide resource allocation, and knowledge of victim demographic trends could assist guide community safety efforts. Because of its extensive scope, this dataset is essential for methodically researching and tackling crime.

Importing Python Libraries¶

As shown below, we must import the necessary Python libraries before we can begin this course. Throughout the tutorial, these libraries will be crucial. Because the provided code is optimized for execution in Jupyter Notebook, it is advised to utilize it for this session. Because it makes data visualization and analysis easier, Jupyter Notebook is a popular tool among data scientists. As we move further, we will delve deeper into the functions and goals of each library as they relate to their respective applications.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
import geopandas as gpd
import folium
from shapely.geometry import Point
from folium.plugins import HeatMap
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Input
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Filter out the warnings
import warnings
warnings.filterwarnings("ignore")
Download and Import The Data¶

We head to the LA Crime dataset and lets download any the dataset file. We extract the data from the dataset file and keep in mind to keep the dataset file in the same folder as the program file is designed in that manner. So that you can can replicate it if required.

In [3]:
# Read the uploaded file to determine its format
file_path = 'dataset_'
# Read the first few bytes of the file to inspect its structure
with open(file_path, 'rb') as file:
    file_head = file.read(512)  # Read the first 512 bytes

file_head.decode(errors='replace')
Out[3]:
'% Description:\n% This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, p'

Initializes an exploratory analysis by inspecting the first 512 bytes of a dataset file in binary mode to decode its structure, revealing metadata about criminal incidents, including report numbers, dates, times, crime descriptions, locations, and victim/suspect details.

In [4]:
# Display the first few lines of the file to identify delimiters or formatting issues
with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
    for i in range(10):  # Display the first 10 lines
        print(file.readline())
This code snippet initializes an exploratory analysis by inspecting the first 512 bytes of a dataset file in binary mode to decode its structure, revealing metadata about criminal incidents, including report numbers, dates, times, crime descriptions, locations, and victim/suspect details.
% Description:

% This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, public safety organizations, and researchers to understand crime patterns, allocate resources effectively, and develop crime prevention strategies.

%

% Attribute Description:

% - DR_NO: A unique identifier for the crime report.

% - Date Rptd & DATE OCC: The dates when the crime was reported and occurred.

% - TIME OCC: The time when the crime occurred.

% - AREA & AREA NAME: Numeric and textual descriptions of the area where the crime occurred.

% - Rpt Dist No: The reporting district number.

% - Part 1-2: Indicates whether the crime is a Part 1 (more severe) or Part 2 offense.

Reads and prints the first 10 lines of Crime_Data_from_2020_to_Present.csv in UTF-8 encoding to inspect its structure, delimiters, and metadata, revealing attributes like DR_NO, report dates, times, area details, district numbers, and offense classification (Part 1 or Part 2) for further analysis.

In [5]:
# Attempt to locate the line where the actual dataset starts
with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
    lines = file.readlines()

# Display lines to find the starting point of the dataset
for i, line in enumerate(lines[:50]):  # Check the first 50 lines
    print(f"Line {i + 1}: {line.strip()}")
Line 1: % Description:
Line 2: % This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, public safety organizations, and researchers to understand crime patterns, allocate resources effectively, and develop crime prevention strategies.
Line 3: %
Line 4: % Attribute Description:
Line 5: % - DR_NO: A unique identifier for the crime report.
Line 6: % - Date Rptd & DATE OCC: The dates when the crime was reported and occurred.
Line 7: % - TIME OCC: The time when the crime occurred.
Line 8: % - AREA & AREA NAME: Numeric and textual descriptions of the area where the crime occurred.
Line 9: % - Rpt Dist No: The reporting district number.
Line 10: % - Part 1-2: Indicates whether the crime is a Part 1 (more severe) or Part 2 offense.
Line 11: % - Crm Cd & Crm Cd Desc: The crime code and its description.
Line 12: % - Mocodes: Modus operandi codes related to the crime.
Line 13: % - Vict Age, Vict Sex, Vict Descent: Age, sex, and ethnic descent of the victim.
Line 14: % - Premis Cd & Premis Desc: Codes and descriptions of the premises where the crime occurred.
Line 15: % - Weapon Used Cd & Weapon Desc: Codes and descriptions of any weapons used.
Line 16: % - Status & Status Desc: The status of the crime report and its description (e.g., Invest Cont, Adult Arrest).
Line 17: % - Crm Cd 1-4: Additional crime codes related to the incident.
Line 18: % - LOCATION & Cross Street: The specific location and, if applicable, cross street of the crime.
Line 19: % - LAT & LON: Latitude and longitude of the crime location.
Line 20: %
Line 21: % Use Case:
Line 22: % This dataset is crucial for public safety analyses, allowing for the tracking of crime trends, hotspot identification, and the assessment of law enforcement effectiveness. It can also be utilized by policymakers for strategic planning and by academic researchers studying the sociology of crime or developing predictive models. Community groups may use this data to advocate for safety and support initiatives in their neighborhoods.
Line 23: @RELATION Crime_Data_from_2020_to_present_in_Los_Angeles
Line 24: 
Line 25: @ATTRIBUTE DR_NO INTEGER
Line 26: @ATTRIBUTE "Date Rptd" STRING
Line 27: @ATTRIBUTE "DATE OCC" STRING
Line 28: @ATTRIBUTE "TIME OCC" INTEGER
Line 29: @ATTRIBUTE AREA INTEGER
Line 30: @ATTRIBUTE "AREA NAME" STRING
Line 31: @ATTRIBUTE "Rpt Dist No" INTEGER
Line 32: @ATTRIBUTE "Part 1-2" INTEGER
Line 33: @ATTRIBUTE "Crm Cd" INTEGER
Line 34: @ATTRIBUTE "Crm Cd Desc" STRING
Line 35: @ATTRIBUTE Mocodes STRING
Line 36: @ATTRIBUTE "Vict Age" INTEGER
Line 37: @ATTRIBUTE "Vict Sex" STRING
Line 38: @ATTRIBUTE "Vict Descent" STRING
Line 39: @ATTRIBUTE "Premis Cd" REAL
Line 40: @ATTRIBUTE "Premis Desc" STRING
Line 41: @ATTRIBUTE "Weapon Used Cd" REAL
Line 42: @ATTRIBUTE "Weapon Desc" STRING
Line 43: @ATTRIBUTE Status STRING
Line 44: @ATTRIBUTE "Status Desc" STRING
Line 45: @ATTRIBUTE "Crm Cd 1" REAL
Line 46: @ATTRIBUTE "Crm Cd 2" REAL
Line 47: @ATTRIBUTE "Crm Cd 3" REAL
Line 48: @ATTRIBUTE "Crm Cd 4" REAL
Line 49: @ATTRIBUTE LOCATION STRING
Line 50: @ATTRIBUTE "Cross Street" STRING

This process identifies where the actual dataset starts in a file containing descriptive metadata by reading the first 50 lines, revealing attribute descriptions in the @ATTRIBUTE format (typical of ARFF files), ensuring accurate data parsing for further analysis.

In [6]:
# Locate the starting point of the actual data
data_start = None
for i, line in enumerate(lines):
    if "@DATA" in line.upper():  # ARFF files typically use '@DATA' to mark the start of data
        data_start = i + 1  # Data starts after this line
        break

# Display a few lines of the actual data if found
if data_start:
    print(f"Data starts at line {data_start + 1}.")
    for line in lines[data_start:data_start + 10]:
        print(line.strip())
else:
    print("No '@DATA' section found; the structure might differ.")
Data starts at line 55.
190326475,'03/01/2020 12:00:00 AM','03/01/2020 12:00:00 AM',2130,7,Wilshire,784,1,510,'VEHICLE - STOLEN',?,0,M,O,101.0,STREET,?,?,AA,'Adult Arrest',510.0,998.0,?,?,'1900 S  LONGWOOD                     AV',?,34.0375,-118.3506
200106753,'02/09/2020 12:00:00 AM','02/08/2020 12:00:00 AM',1800,1,Central,182,1,330,'BURGLARY FROM VEHICLE','1822 1402 0344',47,M,O,128.0,'BUS STOP/LAYOVER (ALSO QUERY 124)',?,?,IC,'Invest Cont',330.0,998.0,?,?,'1000 S  FLOWER                       ST',?,34.0444,-118.2628
200320258,'11/11/2020 12:00:00 AM','11/04/2020 12:00:00 AM',1700,3,Southwest,356,1,480,'BIKE - STOLEN','0344 1251',19,X,X,502.0,'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)',?,?,IC,'Invest Cont',480.0,?,?,?,'1400 W  37TH                         ST',?,34.021,-118.3002
200907217,'05/10/2023 12:00:00 AM','03/10/2020 12:00:00 AM',2037,9,'Van Nuys',964,1,343,'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)','0325 1501',19,M,O,405.0,'CLOTHING STORE',?,?,IC,'Invest Cont',343.0,?,?,?,'14000    RIVERSIDE                    DR',?,34.1576,-118.4387
220614831,'08/18/2022 12:00:00 AM','08/17/2020 12:00:00 AM',1200,6,Hollywood,666,2,354,'THEFT OF IDENTITY','1822 1501 0930 2004',28,M,H,102.0,SIDEWALK,?,?,IC,'Invest Cont',354.0,?,?,?,'1900    TRANSIENT',?,34.0944,-118.3277
231808869,'04/04/2023 12:00:00 AM','12/01/2020 12:00:00 AM',2300,18,Southeast,1826,2,354,'THEFT OF IDENTITY','1822 0100 0930 0929',41,M,H,501.0,'SINGLE FAMILY DWELLING',?,?,IC,'Invest Cont',354.0,?,?,?,'9900    COMPTON                      AV',?,33.9467,-118.2463
230110144,'04/04/2023 12:00:00 AM','07/03/2020 12:00:00 AM',900,1,Central,182,2,354,'THEFT OF IDENTITY','0930 0929',25,M,H,502.0,'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)',?,?,IC,'Invest Cont',354.0,?,?,?,'1100 S  GRAND                        AV',?,34.0415,-118.262
220314085,'07/22/2022 12:00:00 AM','05/12/2020 12:00:00 AM',1110,3,Southwest,303,2,354,'THEFT OF IDENTITY',0100,27,F,B,248.0,'CELL PHONE STORE',?,?,IC,'Invest Cont',354.0,?,?,?,'2500 S  SYCAMORE                     AV',?,34.0335,-118.3537
231309864,'04/28/2023 12:00:00 AM','12/09/2020 12:00:00 AM',1400,13,Newton,1375,2,354,'THEFT OF IDENTITY',0100,24,F,B,750.0,CYBERSPACE,?,?,IC,'Invest Cont',354.0,?,?,?,'1300 E  57TH                         ST',?,33.9911,-118.2521
211904005,'12/31/2020 12:00:00 AM','12/31/2020 12:00:00 AM',1220,19,Mission,1974,2,624,'BATTERY - SIMPLE ASSAULT',0416,26,M,H,502.0,'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)',400.0,'STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)',IC,'Invest Cont',624.0,?,?,?,'9000    CEDROS                       AV',?,34.2336,-118.4535

This process locates the starting point of the dataset in an ARFF file by identifying the @DATA marker, confirming that data begins at line 55, and displaying initial rows of comma-separated crime records, ensuring accurate parsing for further analysis.

In [7]:
# Re-import necessary libraries
import pandas as pd

# Re-attempt to process the dataset
try:
    # Reload and inspect the first few lines of the file
    with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
        for i in range(10):  # Display the first 10 lines
            print(file.readline().strip())

    # Load the dataset by skipping metadata and identifying the start of the actual data
    df = pd.read_csv(file_path, skiprows=55, delimiter=',', on_bad_lines='skip', header = None)
    print("Dataset loaded successfully!")
except Exception as e:
    print(f"Error occurred while processing the dataset: {e}")
% Description:
% This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, public safety organizations, and researchers to understand crime patterns, allocate resources effectively, and develop crime prevention strategies.
%
% Attribute Description:
% - DR_NO: A unique identifier for the crime report.
% - Date Rptd & DATE OCC: The dates when the crime was reported and occurred.
% - TIME OCC: The time when the crime occurred.
% - AREA & AREA NAME: Numeric and textual descriptions of the area where the crime occurred.
% - Rpt Dist No: The reporting district number.
% - Part 1-2: Indicates whether the crime is a Part 1 (more severe) or Part 2 offense.
Dataset loaded successfully!

This process reloads a dataset by skipping 55 metadata lines and using pandas to parse it as CSV, handling issues like bad lines with on_bad_lines='skip' and avoiding metadata headers with header=None, ensuring a clean DataFrame for analysis.

In [8]:
df.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9 ... 18 19 20 21 22 23 24 25 26 27
0 200106753 '02/09/2020 12:00:00 AM' '02/08/2020 12:00:00 AM' 1800 1 Central 182 1 330 'BURGLARY FROM VEHICLE' ... IC 'Invest Cont' 330.0 998.0 ? ? '1000 S FLOWER ST' ? 34.0444 -118.2628
1 200907217 '05/10/2023 12:00:00 AM' '03/10/2020 12:00:00 AM' 2037 9 'Van Nuys' 964 1 343 'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)' ... IC 'Invest Cont' 343.0 ? ? ? '14000 RIVERSIDE DR' ? 34.1576 -118.4387
2 220614831 '08/18/2022 12:00:00 AM' '08/17/2020 12:00:00 AM' 1200 6 Hollywood 666 2 354 'THEFT OF IDENTITY' ... IC 'Invest Cont' 354.0 ? ? ? '1900 TRANSIENT' ? 34.0944 -118.3277
3 231808869 '04/04/2023 12:00:00 AM' '12/01/2020 12:00:00 AM' 2300 18 Southeast 1826 2 354 'THEFT OF IDENTITY' ... IC 'Invest Cont' 354.0 ? ? ? '9900 COMPTON AV' ? 33.9467 -118.2463
4 220314085 '07/22/2022 12:00:00 AM' '05/12/2020 12:00:00 AM' 1110 3 Southwest 303 2 354 'THEFT OF IDENTITY' ... IC 'Invest Cont' 354.0 ? ? ? '2500 S SYCAMORE AV' ? 34.0335 -118.3537

5 rows × 28 columns

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533557 entries, 0 to 533556
Data columns (total 28 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       533557 non-null  int64 
 1   1       533557 non-null  object
 2   2       533557 non-null  object
 3   3       533557 non-null  int64 
 4   4       533557 non-null  int64 
 5   5       533557 non-null  object
 6   6       533557 non-null  int64 
 7   7       533557 non-null  int64 
 8   8       533557 non-null  int64 
 9   9       533557 non-null  object
 10  10      533557 non-null  object
 11  11      533557 non-null  object
 12  12      533557 non-null  object
 13  13      533557 non-null  object
 14  14      533557 non-null  object
 15  15      533557 non-null  object
 16  16      533557 non-null  object
 17  17      533557 non-null  object
 18  18      533557 non-null  object
 19  19      533557 non-null  object
 20  20      533557 non-null  object
 21  21      533557 non-null  object
 22  22      533557 non-null  object
 23  23      533557 non-null  object
 24  24      533557 non-null  object
 25  25      533557 non-null  object
 26  26      533557 non-null  object
 27  27      533557 non-null  object
dtypes: int64(6), object(22)
memory usage: 114.0+ MB

3. Data Cleaning and Exploratory Analysis¶

Data cleaning is the essential process of preparing a dataset for analysis or machine learning by ensuring it is consistent, complete, and accurate. This process involves several key tasks, such as removing unnecessary or irrelevant data, filling in missing values, and standardizing metrics or measurements to create uniformity. Additionally, new features can be derived from existing data to make the dataset more useful and meaningful. By addressing errors and inconsistencies, data cleaning ensures the dataset is reliable and forms a solid foundation for further analysis or model training.

Often combined with data cleaning, exploratory analysis involves examining the dataset to uncover patterns, trends, and relationships that provide valuable insights. This step includes creating visualizations, such as graphs or plots, to identify correlations or significant variables, and spotting potential issues like outliers that may need cleaning. Insights gained during this process may guide the creation of new features or adjustments to existing ones, refining the dataset for better performance in a machine learning model. By integrating these two steps, we not only ensure the dataset is clean but also well-understood, which is critical for building effective models or conducting insightful analysis.

Key Steps in Data Cleaning:¶
  • Remove unnecessary or irrelevant data.
  • Fill in missing values to address gaps.
  • Standardize metrics or measurements.
  • Create new features from existing data to enhance usability.
  • Ensure data accuracy and reliability.
Key Steps in Exploratory Analysis:¶
  • Visualize data through graphs and plots to uncover patterns and relationships.
  • Identify significant features or correlations.
  • Detect issues like outliers or irregularities for further cleaning.
  • Refine research questions based on insights from the data.
  • Create or adjust features to align with identified trends or insights.
  • Combine exploratory analysis with cleaning for a comprehensive understanding of the dataset.
In [10]:
df.columns = [
    "DR_NO", "Date_Rptd", "Date_Occ", "Time_Occ", "Area", "Area_Name",
    "Rpt_Dist_No", "Part_1_2", "Crm_Cd", "Crm_Cd_Desc", "Mocodes", "Vict_Age",
    "Vict_Sex", "Vict_Descent", "Premis_Cd", "Premis_Desc", "Weapon_Used_Cd",
    "Weapon_Desc", "Status", "Status_Desc", "Crm_Cd_1", "Crm_Cd_2", "Crm_Cd_3",
    "Crm_Cd_4", "Location", "Cross_Street", "Lat", "Lon"
]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533557 entries, 0 to 533556
Data columns (total 28 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   DR_NO           533557 non-null  int64 
 1   Date_Rptd       533557 non-null  object
 2   Date_Occ        533557 non-null  object
 3   Time_Occ        533557 non-null  int64 
 4   Area            533557 non-null  int64 
 5   Area_Name       533557 non-null  object
 6   Rpt_Dist_No     533557 non-null  int64 
 7   Part_1_2        533557 non-null  int64 
 8   Crm_Cd          533557 non-null  int64 
 9   Crm_Cd_Desc     533557 non-null  object
 10  Mocodes         533557 non-null  object
 11  Vict_Age        533557 non-null  object
 12  Vict_Sex        533557 non-null  object
 13  Vict_Descent    533557 non-null  object
 14  Premis_Cd       533557 non-null  object
 15  Premis_Desc     533557 non-null  object
 16  Weapon_Used_Cd  533557 non-null  object
 17  Weapon_Desc     533557 non-null  object
 18  Status          533557 non-null  object
 19  Status_Desc     533557 non-null  object
 20  Crm_Cd_1        533557 non-null  object
 21  Crm_Cd_2        533557 non-null  object
 22  Crm_Cd_3        533557 non-null  object
 23  Crm_Cd_4        533557 non-null  object
 24  Location        533557 non-null  object
 25  Cross_Street    533557 non-null  object
 26  Lat             533557 non-null  object
 27  Lon             533557 non-null  object
dtypes: int64(6), object(22)
memory usage: 114.0+ MB

Renames the columns of the DataFrame df to a specified list of column names and displays a summary of the DataFrame structure using df.info().

In [11]:
df['Crm_Cd_Desc'].unique()
Out[11]:
array(["'BURGLARY FROM VEHICLE'",
       "'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)'",
       "'THEFT OF IDENTITY'", "'VEHICLE - STOLEN'",
       "'CRIMINAL THREATS - NO WEAPON DISPLAYED'",
       "'THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)'",
       "'CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER)'",
       'BURGLARY', "'THEFT PLAIN - PETTY ($950 & UNDER)'",
       "'LEWD CONDUCT'", "'THEFT PLAIN - ATTEMPT'",
       "'THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)'",
       "'CHILD ANNOYING (17YRS & UNDER)'", "'OTHER MISCELLANEOUS CRIME'",
       'ROBBERY', "'UNAUTHORIZED COMPUTER ACCESS'",
       "'VIOLATION OF RESTRAINING ORDER'",
       "'SHOPLIFTING - PETTY THEFT ($950 & UNDER)'", "'BRANDISH WEAPON'",
       "'DOCUMENT FORGERY / STOLEN FELONY'",
       "'SEX OFFENDER REGISTRANT OUT OF COMPLIANCE'",
       "'VANDALISM - MISDEAMEANOR ($399 OR UNDER)'",
       "'CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT'", "'BIKE - STOLEN'",
       'EXTORTION', 'PICKPOCKET', 'ARSON', "'DISTURBING THE PEACE'",
       "'PEEPING TOM'", "'ORAL COPULATION'", "'VIOLATION OF COURT ORDER'",
       "'INTIMATE PARTNER - SIMPLE ASSAULT'", "'FALSE POLICE REPORT'",
       "'INTIMATE PARTNER - AGGRAVATED ASSAULT'", 'CONTRIBUTING',
       "'FALSE IMPRISONMENT'", "'ATTEMPTED ROBBERY'", "'CHILD STEALING'",
       "'INDECENT EXPOSURE'", "'CHILD NEGLECT (SEE 300 W.I.C.)'",
       "'DISHONEST EMPLOYEE - GRAND THEFT'", 'TRESPASSING',
       "'BATTERY - SIMPLE ASSAULT'", "'CONTEMPT OF COURT'",
       "'THREATENING PHONE CALLS/LETTERS'", 'PIMPING',
       "'VEHICLE - ATTEMPT STOLEN'", 'PANDERING',
       "'LEWD/LASCIVIOUS ACTS WITH CHILD'",
       "'HUMAN TRAFFICKING - COMMERCIAL SEX ACTS'",
       "'FIREARMS RESTRAINING ORDER (FIREARMS RO)'",
       "'DISCHARGE FIREARMS/SHOTS FIRED'", "'FAILURE TO YIELD'",
       "'BOMB SCARE'", "'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER'",
       "'OTHER ASSAULT'", "'BATTERY POLICE (SIMPLE)'",
       "'THEFT FROM PERSON - ATTEMPT'",
       "'SHOTS FIRED AT INHABITED DWELLING'",
       "'CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT'",
       "'TILL TAP - GRAND THEFT ($950.01 & OVER)'",
       "'VIOLATION OF TEMPORARY RESTRAINING ORDER'", "'RESISTING ARREST'",
       "'THROWING OBJECT AT MOVING VEHICLE'",
       "'DOCUMENT WORTHLESS ($200.01 & OVER)'",
       "'SEXUAL PENETRATION W/FOREIGN OBJECT'", 'KIDNAPPING',
       "'CRIMINAL HOMICIDE'", "'PURSE SNATCHING'",
       "'THEFT FROM MOTOR VEHICLE - ATTEMPT'",
       "'SODOMY/SEXUAL CONTACT B/W PENIS OF ONE PERS TO ANUS OTH'",
       "'DRIVING WITHOUT OWNER CONSENT (DWOC)'", "'RECKLESS DRIVING'",
       'STALKING', "'SHOPLIFTING - ATTEMPT'", "'CHILD PORNOGRAPHY'",
       "'BATTERY WITH SEXUAL CONTACT'", 'COUNTERFEIT',
       "'CRUELTY TO ANIMALS'", "'BOAT - STOLEN'", "'ILLEGAL DUMPING'",
       'PROWLER', "'DOCUMENT WORTHLESS ($200 & UNDER)'",
       "'BATTERY ON A FIREFIGHTER'", "'PETTY THEFT - AUTO REPAIR'",
       "'TILL TAP - PETTY ($950 & UNDER)'",
       "'KIDNAPPING - GRAND ATTEMPT'",
       "'DISHONEST EMPLOYEE - PETTY THEFT'",
       "'HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE'",
       "'WEAPONS POSSESSION/BOMBING'", "'BIKE - ATTEMPTED STOLEN'",
       "'GRAND THEFT / AUTO REPAIR'", 'CONSPIRACY', 'BRIBERY',
       "'PURSE SNATCHING - ATTEMPT'", "'GRAND THEFT / INSURANCE FRAUD'",
       "'DRUNK ROLL'", "'CHILD ABANDONMENT'", "'DISRUPT SCHOOL'",
       "'FAILURE TO DISPERSE'",
       "'FIREARMS EMERGENCY PROTECTIVE ORDER (FIREARMS EPO)'", 'BIGAMY',
       "'VANDALISM - FELONY ($400 & OVER", "'ASSAULT WITH DEADLY WEAPON",
       "'BURGLARY", "'CREDIT CARDS", "'EMBEZZLEMENT", "'BUNCO", "'THEFT",
       "'BURGLARY FROM VEHICLE", "'RAPE",
       "'SHOTS FIRED AT MOVING VEHICLE",
       "'DEFRAUDING INNKEEPER/THEFT OF SERVICES", "'BEASTIALITY",
       "'INCEST (SEXUAL ACTS BETWEEN BLOOD RELATIVES)'", "'DRUGS",
       "'TELEPHONE PROPERTY - DAMAGE'", "'INCITING A RIOT'",
       "'DISHONEST EMPLOYEE ATTEMPTED THEFT'",
       "'BLOCKING DOOR INDUCTION CENTER'", "'LYNCHING - ATTEMPTED'",
       'LYNCHING', "'TRAIN WRECKING'", "'LETTERS", "'SEX"], dtype=object)

Retrieves and displays all the unique values in the Crm_Cd_Desc column of the DataFrame df, which represent the unique descriptions of crime categories in the dataset.

In [12]:
df.drop(columns = ['Weapon_Used_Cd', 'Weapon_Desc', 'Crm_Cd_1', 'Crm_Cd_2', 'Crm_Cd_3', 'Crm_Cd_4', 'Cross_Street'], inplace = True, axis = 1)

df['Vict_Descent'] = df['Vict_Descent'].fillna('None')
df['Vict_Sex'] = df['Vict_Sex'].fillna('None')
df['Mocodes'] = df['Mocodes'].fillna('none')
df['Premis_Desc'] = df['Premis_Desc'].fillna('None')

df['Date_Rptd'] = pd.to_datetime(df['Date_Rptd'].str[:11])
df['Date_Occ'] = pd.to_datetime(df['Date_Occ'].str[:11])

df.head()
Out[12]:
DR_NO Date_Rptd Date_Occ Time_Occ Area Area_Name Rpt_Dist_No Part_1_2 Crm_Cd Crm_Cd_Desc ... Vict_Age Vict_Sex Vict_Descent Premis_Cd Premis_Desc Status Status_Desc Location Lat Lon
0 200106753 2020-02-09 2020-02-08 1800 1 Central 182 1 330 'BURGLARY FROM VEHICLE' ... 47 M O 128.0 'BUS STOP/LAYOVER (ALSO QUERY 124)' IC 'Invest Cont' '1000 S FLOWER ST' 34.0444 -118.2628
1 200907217 2023-05-10 2020-03-10 2037 9 'Van Nuys' 964 1 343 'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)' ... 19 M O 405.0 'CLOTHING STORE' IC 'Invest Cont' '14000 RIVERSIDE DR' 34.1576 -118.4387
2 220614831 2022-08-18 2020-08-17 1200 6 Hollywood 666 2 354 'THEFT OF IDENTITY' ... 28 M H 102.0 SIDEWALK IC 'Invest Cont' '1900 TRANSIENT' 34.0944 -118.3277
3 231808869 2023-04-04 2020-12-01 2300 18 Southeast 1826 2 354 'THEFT OF IDENTITY' ... 41 M H 501.0 'SINGLE FAMILY DWELLING' IC 'Invest Cont' '9900 COMPTON AV' 33.9467 -118.2463
4 220314085 2022-07-22 2020-05-12 1110 3 Southwest 303 2 354 'THEFT OF IDENTITY' ... 27 F B 248.0 'CELL PHONE STORE' IC 'Invest Cont' '2500 S SYCAMORE AV' 34.0335 -118.3537

5 rows × 21 columns

The DataFrame by dropping unnecessary columns, filling missing values with 'None', converting date columns to datetime, and previewing the cleaned data.

In [13]:
df.isnull().sum()
Out[13]:
DR_NO           0
Date_Rptd       0
Date_Occ        0
Time_Occ        0
Area            0
Area_Name       0
Rpt_Dist_No     0
Part_1_2        0
Crm_Cd          0
Crm_Cd_Desc     0
Mocodes         0
Vict_Age        0
Vict_Sex        0
Vict_Descent    0
Premis_Cd       0
Premis_Desc     0
Status          0
Status_Desc     0
Location        0
Lat             0
Lon             0
dtype: int64

Checks for missing values in the DataFrame df by using df.isnull().sum(). It outputs the total count of missing values for each column. The result shows that all columns have 0 missing values, indicating the dataset has been successfully cleaned of any null or missing data.

In [15]:
df_cleaned = df.dropna()

df_cleaned['Vict_Age'] = pd.to_numeric(df_cleaned['Vict_Age'], errors='coerce').astype('Int64')
df_cleaned['Lat'] = pd.to_numeric(df_cleaned['Lat'], errors='coerce')
df_cleaned['Lon'] = pd.to_numeric(df_cleaned['Lon'], errors='coerce')

df_cleaned['Vict_Sex'] = df_cleaned['Vict_Sex'].astype('category')
df_cleaned['Vict_Descent'] = df_cleaned['Vict_Descent'].astype('category')

print(df_cleaned.isnull().sum())
print(df_cleaned.info())
DR_NO               0
Date_Rptd           0
Date_Occ            0
Time_Occ            0
Area                0
Area_Name           0
Rpt_Dist_No         0
Part_1_2            0
Crm_Cd              0
Crm_Cd_Desc         0
Mocodes             0
Vict_Age         9550
Vict_Sex            0
Vict_Descent        0
Premis_Cd           0
Premis_Desc         0
Status              0
Status_Desc         0
Location            0
Lat             14420
Lon              2185
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533557 entries, 0 to 533556
Data columns (total 21 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   DR_NO         533557 non-null  int64         
 1   Date_Rptd     533557 non-null  datetime64[ns]
 2   Date_Occ      533557 non-null  datetime64[ns]
 3   Time_Occ      533557 non-null  int64         
 4   Area          533557 non-null  int64         
 5   Area_Name     533557 non-null  object        
 6   Rpt_Dist_No   533557 non-null  int64         
 7   Part_1_2      533557 non-null  int64         
 8   Crm_Cd        533557 non-null  int64         
 9   Crm_Cd_Desc   533557 non-null  object        
 10  Mocodes       533557 non-null  object        
 11  Vict_Age      524007 non-null  Int64         
 12  Vict_Sex      533557 non-null  category      
 13  Vict_Descent  533557 non-null  category      
 14  Premis_Cd     533557 non-null  object        
 15  Premis_Desc   533557 non-null  object        
 16  Status        533557 non-null  object        
 17  Status_Desc   533557 non-null  object        
 18  Location      533557 non-null  object        
 19  Lat           519137 non-null  float64       
 20  Lon           531372 non-null  float64       
dtypes: Int64(1), category(2), datetime64[ns](2), float64(2), int64(6), object(8)
memory usage: 79.4+ MB
None

Cleans the dataset by removing null values, converting numeric columns (Vict_Age, Lat, Lon) to integers, and optimizing Vict_Sex and Vict_Descent as categorical data types.

In [16]:
df_cleaned = df_cleaned.dropna(subset=['Lat', 'Lon', 'Vict_Age'])
# Verify the cleaned DataFrame
print(df_cleaned.isnull().sum())
print(df_cleaned.info())
DR_NO           0
Date_Rptd       0
Date_Occ        0
Time_Occ        0
Area            0
Area_Name       0
Rpt_Dist_No     0
Part_1_2        0
Crm_Cd          0
Crm_Cd_Desc     0
Mocodes         0
Vict_Age        0
Vict_Sex        0
Vict_Descent    0
Premis_Cd       0
Premis_Desc     0
Status          0
Status_Desc     0
Location        0
Lat             0
Lon             0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Index: 519136 entries, 0 to 533556
Data columns (total 21 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   DR_NO         519136 non-null  int64         
 1   Date_Rptd     519136 non-null  datetime64[ns]
 2   Date_Occ      519136 non-null  datetime64[ns]
 3   Time_Occ      519136 non-null  int64         
 4   Area          519136 non-null  int64         
 5   Area_Name     519136 non-null  object        
 6   Rpt_Dist_No   519136 non-null  int64         
 7   Part_1_2      519136 non-null  int64         
 8   Crm_Cd        519136 non-null  int64         
 9   Crm_Cd_Desc   519136 non-null  object        
 10  Mocodes       519136 non-null  object        
 11  Vict_Age      519136 non-null  Int64         
 12  Vict_Sex      519136 non-null  category      
 13  Vict_Descent  519136 non-null  category      
 14  Premis_Cd     519136 non-null  object        
 15  Premis_Desc   519136 non-null  object        
 16  Status        519136 non-null  object        
 17  Status_Desc   519136 non-null  object        
 18  Location      519136 non-null  object        
 19  Lat           519136 non-null  float64       
 20  Lon           519136 non-null  float64       
dtypes: Int64(1), category(2), datetime64[ns](2), float64(2), int64(6), object(8)
memory usage: 81.2+ MB
None

Removes rows with missing values in the Lat, Lon, and Vict_Age columns from the df_cleaned DataFrame, then verifies the cleaned dataset by printing the count of missing values and displaying the DataFrame's structure and summary using df.info().

EDA¶

Step 1: Exploratory Data Analysis¶

In [22]:
eda_results = {
    "Crime Type Frequency": df['Crm_Cd_Desc'].value_counts().head(10),
    "Area Crime Count": df['Area_Name'].value_counts(),
    "Victim Age Statistics": df['Vict_Age'].describe(),
    "Crimes by Time of Day": df['Time_Occ'].value_counts(bins=4).sort_index(),
    "Top Premises for Crimes": df['Premis_Desc'].value_counts().head(10)
}

# Prepare for time-series analysis
df['Year_Month'] = df['Date_Occ'].dt.to_period('M')
crimes_by_month = df.groupby('Year_Month').size()
crimes_by_month
Out[22]:
Year_Month
2020-01    10044
2020-02     9252
2020-03     8867
2020-04     8772
2020-05     9687
2020-06     9397
2020-07     9301
2020-08     8802
2020-09     8190
2020-10     8889
2020-11     8603
2020-12     8977
2021-01     9794
2021-02     9171
2021-03     9602
2021-04     9298
2021-05     9701
2021-06     9602
2021-07    10328
2021-08    10222
2021-09    10477
2021-10    11161
2021-11    10850
2021-12    10818
2022-01    10484
2022-02    10086
2022-03    11172
2022-04    11214
2022-05    11552
2022-06    11248
2022-07    11110
2022-08    11404
2022-09    10858
2022-10    11361
2022-11    10835
2022-12    11690
2023-01    11935
2023-02    10872
2023-03    11146
2023-04    10899
2023-05    10733
2023-06    10619
2023-07    11287
2023-08    11585
2023-09    11024
2023-10    11633
2023-11    11268
2023-12    11552
2024-01    11937
2024-02    10749
2024-03    10084
2024-04     3415
Freq: M, dtype: int64

Performed EDA by summarizing crime frequencies, victim age statistics, and crime timings while preparing the dataset for time-series analysis by grouping crimes by month.

In [23]:
eda_results
Out[23]:
{'Crime Type Frequency': Crm_Cd_Desc
 'VEHICLE - STOLEN'                                       100157
 'BURGLARY FROM VEHICLE'                                   54784
 BURGLARY                                                  47246
 'THEFT OF IDENTITY'                                       42839
 'THEFT PLAIN - PETTY ($950 & UNDER)'                      35574
 'THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)'         35050
 'THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)'     31752
 'SHOPLIFTING - PETTY THEFT ($950 & UNDER)'                23182
 ROBBERY                                                   17190
 'VANDALISM - MISDEAMEANOR ($399 OR UNDER)'                14468
 Name: count, dtype: int64,
 'Area Crime Count': Area_Name
 Central          33578
 Pacific          33206
 '77th Street'    31469
 Wilshire         27885
 'N Hollywood'    27783
 Southwest        27020
 Newton           26984
 'West LA'        26039
 Hollywood        25816
 Northeast        25662
 Southeast        25431
 Devonshire       24734
 Olympic          23803
 'Van Nuys'       23491
 'West Valley'    23398
 Topanga          23031
 Harbor           21714
 Mission          21468
 Rampart          21413
 Hollenbeck       20713
 Foothill         18919
 Name: count, dtype: int64,
 'Victim Age Statistics': count     533557
 unique      6741
 top            0
 freq      168871
 Name: Vict_Age, dtype: int64,
 'Crimes by Time of Day': (-1.359, 590.5]      80687
 (590.5, 1180.0]     109172
 (1180.0, 1769.5]    172345
 (1769.5, 2359.0]    171353
 Name: count, dtype: int64,
 'Top Premises for Crimes': Premis_Desc
 STREET                            178786
 'SINGLE FAMILY DWELLING'           96667
 'PARKING LOT'                      46842
 'OTHER BUSINESS'                   26576
 GARAGE/CARPORT                     15602
 SIDEWALK                           14434
 DRIVEWAY                           11586
 'DEPARTMENT STORE'                 10939
 'RESTAURANT/FAST FOOD'              6860
 'PARKING UNDERGROUND/BUILDING'      6778
 Name: count, dtype: int64}

The eda_results dictionary, which contains key insights from the dataset, such as the top 10 crime types, crime counts by area, victim age statistics, crime distributions by time of day, and the top premises for crimes.

Graph 1: Top 10 Crime Type¶
In [28]:
plt.figure(figsize=(10, 10))
df['Crm_Cd_Desc'].value_counts().head(10).plot(kind='bar', title="Top 10 Crime Types")
plt.xlabel('Crime Type')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart illustrates the top 10 most frequent crime types, with "Vehicle - Stolen" leading significantly, surpassing 100,000 reported incidents. This highlights vehicle theft as a prominent issue in the dataset's coverage area. Following this, crimes like "Burglary from Vehicle", "Burglary", and "Theft of Identity" also show high frequencies, emphasizing a pattern of property-related offenses and vulnerabilities in vehicle and property security.

Petty theft-related crimes, including "Theft Plain - Petty ($950 & Under)", "Theft from Motor Vehicle", and "Shoplifting - Petty Theft", are also prevalent, reflecting opportunistic behaviors targeting easily accessible items. Less frequent but still notable offenses, such as "Robbery" and "Vandalism - Misdemeanor ($399 or Under)", further underscore the dominance of property crimes in the area, suggesting a need for focused preventive measures.

Graph 2: Crime Count by Area¶
In [32]:
plt.figure(figsize=(15, 10))
df['Area_Name'].value_counts().plot(kind='bar', title="Crime Count by Area")
plt.xlabel('Area Name')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart visualizes the number of crimes reported in different areas, with the area names on the x-axis and the number of crimes on the y-axis. The findings indicate that "Central" and "Pacific" areas have the highest crime counts, each with over 30,000 incidents, making them hotspots for criminal activity. These are followed closely by "77th Street", "Wilshire", and "West Hollywood", which also show significantly high crime rates.

Other areas, such as "Southwest", "Newton", and "West LA", report moderately high crime counts, while areas like "Foothill" and "Rampart" have relatively lower counts compared to the leading areas. This distribution suggests that certain regions experience a disproportionate amount of crime, highlighting the need for targeted law enforcement and community safety initiatives in these high-crime areas.

Graph 3: Victim Age Distribution¶
In [33]:
df['Vict_Age'] = pd.to_numeric(df['Vict_Age'], errors='coerce')
df['Vict_Age'] = df['Vict_Age'].fillna(df['Vict_Age'].median())
df['Lat'] = pd.to_numeric(df['Lat'], errors='coerce')
df['Lon'] = pd.to_numeric(df['Lon'], errors='coerce')
df = df.dropna(subset=['Lat', 'Lon'])  
df = df[(df['Vict_Age'] > 0) & (df['Vict_Age'] <= 100)]
print(df['Vict_Age'].describe())
df.reset_index(drop=True, inplace=True)
 

plt.figure(figsize=(10, 6))
df['Vict_Age'].plot(kind='hist', bins=20, title="Victim Age Distribution", color='blue')
plt.xlabel('Victim Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
count    326863.000000
mean         40.862783
std          15.545000
min           2.000000
25%          29.000000
50%          38.000000
75%          51.000000
max          99.000000
Name: Vict_Age, dtype: float64
No description has been provided for this image

The bar chart visualizes the age distribution of crime victims, with preprocessing steps including converting Vict_Age to numeric values, filling missing ages with the median, and filtering for realistic values between 0 and 100. This ensures a clean and accurate representation of the data.

The histogram reveals that most crime victims are between 20 and 40 years old, peaking around 30, indicating young adults are the most affected group. Victim frequency declines steadily beyond 40 and drops significantly after 60, suggesting lower victimization rates among older individuals. These findings underscore the need for targeted safety measures for young adults, who are at a higher risk of crime.

Graph 4: Crimes by Time of Day¶
In [35]:
time_bins = [0, 600, 1200, 1800, 2400]
time_labels = ['Midnight to Morning', 'Morning to Noon', 'Noon to Evening', 'Evening to Midnight']
df['Time_Binned'] = pd.cut(df['Time_Occ'], bins=time_bins, labels=time_labels, right=False)

# Plot the cleaned Time of Day distribution
plt.figure(figsize=(10, 6))
df['Time_Binned'].value_counts().sort_index().plot(kind='bar', title="Crimes by Time of Day (Cleaned)")
plt.xlabel('Time of Day')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart visualizes the distribution of crimes across four time intervals: Midnight to Morning, Morning to Noon, Noon to Evening, and Evening to Midnight. The data reveals that most crimes occur Noon to Evening, followed by Evening to Midnight, indicating a higher crime rate during the latter part of the day.

In contrast, fewer crimes are reported Morning to Noon, with the lowest frequency occurring Midnight to Morning. This trend suggests that criminal activity peaks in the afternoon and evening hours, tapering off during the early morning, potentially reflecting variations in daily routines and societal activity levels.

Graph 5: Crimes by Year and Month¶
In [36]:
# Remove the last data point (potentially incomplete month/year) from the time series
filtered_crimes_by_month = crimes_by_month.iloc[:-1]

# Plot the filtered Crimes by Year and Month
plt.figure(figsize=(12, 6))
filtered_crimes_by_month.plot(kind='line', title="Crimes by Year and Month (Filtered)")
plt.xlabel('Year-Month')
plt.ylabel('Number of Crimes')
plt.grid()
plt.tight_layout()
plt.show()
No description has been provided for this image

The line chart visualizes the monthly trend of reported crimes from 2020 to 2023, excluding the final incomplete data point for precision. The data reveals a clear upward trajectory in crime rates over the years, punctuated by occasional dips and spikes, suggesting periods of varying criminal activity.

These fluctuations hint at potential seasonality or specific factors influencing crime rates, while the overall increase underscores a growing concern. This trend highlights the importance of sustained efforts to address and mitigate criminal activity in the region.

In [40]:
# Create a sparse matrix for area and crime type
area_crime_matrix = pd.crosstab(df['Area_Name'], df['Crm_Cd_Desc'])
sparse_matrix = csr_matrix(area_crime_matrix.values)

# Calculate additional metrics
metrics = {
    "Total Records": len(df),
    "Total Unique Crime Types": df['Crm_Cd_Desc'].nunique(),
    "Total Unique Areas": df['Area_Name'].nunique(),
    "Missing Values": df.isnull().sum().sum(),
    "Density of Sparse Matrix": (sparse_matrix.nnz / np.prod(sparse_matrix.shape)),
}


# Sparse Matrix Dimensions
sparse_matrix_shape = sparse_matrix.shape


metrics_output = {
    "Total Records": metrics["Total Records"],
    "Total Unique Crime Types": metrics["Total Unique Crime Types"],
    "Total Unique Areas": metrics["Total Unique Areas"],
    "Missing Values": metrics["Missing Values"],
    "Density of Sparse Matrix": metrics["Density of Sparse Matrix"],
    "Sparse Matrix Shape": sparse_matrix_shape,
}

This code creates a sparse matrix to analyze the relationship between areas and crime types, calculates metrics, and outputs key dataset statistics:

  1. Sparse Matrix Creation:

    • A crosstabulation is created using pd.crosstab to map Area_Name (rows) to Crm_Cd_Desc (columns), showing the frequency of each crime type in each area.
    • The resulting matrix is converted into a sparse matrix format using csr_matrix for efficient storage.
  2. Metrics Calculation:

    • Total Records: The number of rows in the dataset.
    • Total Unique Crime Types: The number of distinct crime types.
    • Total Unique Areas: The number of unique areas.
    • Missing Values: The total number of missing values in the dataset.
    • Density of Sparse Matrix: The ratio of non-zero elements to the total elements in the sparse matrix, indicating how "dense" the matrix is.
  3. Output:

    • Outputs the calculated metrics and the dimensions of the sparse matrix for further analysis or reporting.

This step summarizes the dataset's structure and provides a compressed representation of the area-crime relationships, useful for efficient data manipulation and machine learning applications.

In [41]:
area_crime_matrix
Out[41]:
Crm_Cd_Desc 'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER' 'ATTEMPTED ROBBERY' 'BATTERY - SIMPLE ASSAULT' 'BATTERY ON A FIREFIGHTER' 'BATTERY POLICE (SIMPLE)' 'BATTERY WITH SEXUAL CONTACT' 'BIKE - ATTEMPTED STOLEN' 'BIKE - STOLEN' 'BLOCKING DOOR INDUCTION CENTER' 'BOMB SCARE' ... COUNTERFEIT EXTORTION KIDNAPPING PANDERING PICKPOCKET PIMPING PROWLER ROBBERY STALKING TRESPASSING
Area_Name
'77th Street' 8 237 319 0 5 2 0 24 0 0 ... 0 76 11 3 13 42 4 1579 31 159
'N Hollywood' 2 71 248 1 0 3 0 322 0 0 ... 1 72 7 0 31 0 7 416 8 391
'Van Nuys' 1 62 189 1 2 2 0 203 0 11 ... 4 62 6 8 22 8 1 417 14 322
'West LA' 0 46 249 1 1 4 1 831 0 1 ... 2 58 1 1 89 4 13 212 12 460
'West Valley' 1 67 187 0 3 3 0 164 4 2 ... 2 92 7 0 25 1 61 367 9 415
Central 41 205 807 7 13 5 0 673 0 17 ... 0 22 15 0 599 3 0 1139 11 299
Devonshire 1 60 192 2 2 2 0 112 0 0 ... 0 89 7 0 37 1 7 249 14 371
Foothill 2 58 173 0 3 2 0 38 0 5 ... 4 86 16 0 5 0 0 333 14 218
Harbor 0 76 183 0 0 0 0 89 0 1 ... 1 59 7 0 19 0 3 342 11 185
Hollenbeck 12 79 212 1 2 4 0 48 0 1 ... 0 61 7 0 25 0 0 465 7 124
Hollywood 7 111 397 1 7 18 0 318 0 3 ... 1 55 17 3 585 15 1 758 38 413
Mission 11 92 140 0 3 1 0 79 0 8 ... 4 107 8 0 8 0 0 439 24 269
Newton 3 186 370 3 0 1 0 55 0 0 ... 0 62 10 0 120 0 0 1098 8 109
Northeast 2 87 221 1 6 2 2 256 0 1 ... 2 74 3 0 166 0 9 360 26 298
Olympic 5 139 352 0 1 7 0 250 0 2 ... 2 48 12 1 190 10 2 701 17 146
Pacific 13 70 295 2 2 12 1 1124 0 17 ... 0 62 3 1 102 6 16 364 9 289
Rampart 2 177 365 0 2 2 0 182 0 1 ... 1 22 10 0 111 2 1 789 11 139
Southeast 8 170 229 1 0 0 0 22 0 2 ... 1 74 20 7 7 9 0 1140 35 155
Southwest 47 147 376 3 15 2 0 740 1 13 ... 3 101 5 2 255 0 10 927 32 413
Topanga 25 54 201 0 2 1 0 111 0 9 ... 13 89 2 0 39 0 8 416 13 422
Wilshire 19 96 254 1 11 2 0 289 0 9 ... 7 57 11 0 212 0 4 673 28 560

21 rows × 107 columns

The area_crime_matrix presents a crosstabulation of Area_Name (rows) and Crm_Cd_Desc (columns), detailing the frequency of each crime type in different areas. Each cell indicates how often a specific crime occurred in a given area, offering a granular view of crime distribution.

Key findings reveal that areas like Central, Wilshire, and Pacific exhibit higher counts across multiple crime types, marking them as crime hotspots. Conversely, certain areas report low or zero occurrences for specific crimes, highlighting regional variations in crime patterns. This matrix serves as a valuable tool for targeted interventions and area-specific crime analysis.

In [42]:
metrics
Out[42]:
{'Total Records': 326863,
 'Total Unique Crime Types': 107,
 'Total Unique Areas': 21,
 'Missing Values': 0,
 'Density of Sparse Matrix': 0.7427681352914998}

The metrics output summarizes the dataset with 326,863 records, 107 crime types, 21 areas, no missing values, and a sparse matrix density of 74%.

In [43]:
sparse_matrix_shape
Out[43]:
(21, 107)

The sparse_matrix_shape output shows that the sparse matrix has 21 rows (areas) and 107 columns (crime types).

In [44]:
metrics_output
Out[44]:
{'Total Records': 326863,
 'Total Unique Crime Types': 107,
 'Total Unique Areas': 21,
 'Missing Values': 0,
 'Density of Sparse Matrix': 0.7427681352914998,
 'Sparse Matrix Shape': (21, 107)}

The metrics_output summarizes the dataset with 326,863 records, 107 crime types, 21 areas, no missing values, a sparse matrix density of 74.28%, and dimensions of (21, 107).

In [46]:
# Removing extra quotes if any
df['Area_Name'] = df['Area_Name'].str.replace("'", "")
df['Crm_Cd_Desc'] = df['Crm_Cd_Desc'].str.replace("'", "")

# Create a sparse matrix (Area vs. Crime Type)
area_crime_matrix = pd.crosstab(df['Area_Name'], df['Crm_Cd_Desc'])
sparse_matrix = csr_matrix(area_crime_matrix.values)

# Plot the sparse matrix as a heatmap
plt.figure(figsize=(12, 8))
plt.imshow(area_crime_matrix.values, cmap="YlGnBu", aspect="auto")
plt.colorbar(label="Crime Count")
plt.xticks(range(area_crime_matrix.columns.size), area_crime_matrix.columns, rotation=90, fontsize=8)
plt.yticks(range(area_crime_matrix.index.size), area_crime_matrix.index, fontsize=10)
plt.title("Area vs Crime Type (Heatmap)", fontsize=14)
plt.xlabel("Crime Type", fontsize=12)
plt.ylabel("Area", fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

This heatmap visualizes the relationship between areas and crime types, with preprocessing steps including the removal of extra quotes from Area_Name and Crm_Cd_Desc and the creation of a sparse matrix where rows represent areas, columns represent crime types, and values indicate crime counts.

The heatmap uses the YlGnBu color scheme, with darker shades signifying higher crime counts. It highlights areas like Central and Wilshire, which show higher activity across multiple crime types. Most crimes are sparsely distributed, with a few types dominating specific areas. This visualization effectively identifies patterns and hotspots, aiding targeted analysis and intervention strategies.

In [47]:
# Identify the top 10 crime types
top_10_crime_types = df['Crm_Cd_Desc'].value_counts().head(10).index

# Filter the area-crime matrix for the top 10 crime types
filtered_area_crime_matrix = area_crime_matrix[top_10_crime_types]

# Plot the filtered matrix as a heatmap
plt.figure(figsize=(12, 8))
plt.imshow(filtered_area_crime_matrix.values, cmap="YlGnBu", aspect="auto")
plt.colorbar(label="Crime Count")
plt.xticks(range(filtered_area_crime_matrix.columns.size), filtered_area_crime_matrix.columns, rotation=45, ha="right")
plt.yticks(range(filtered_area_crime_matrix.index.size), filtered_area_crime_matrix.index)
plt.title("Top 10 Crime Types by Area (Heatmap)", fontsize=14)
plt.xlabel("Crime Type", fontsize=12)
plt.ylabel("Area", fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

This heatmap visualizes the distribution of the top 10 most frequent crime types across different areas, focusing on high-frequency crimes. The data was filtered to include only the top 10 crime types, creating a focused representation of key patterns. The x-axis represents these crime types, while the y-axis represents various areas, with darker shades in the YlGnBu color scheme indicating higher crime counts.

Key insights reveal that areas like Central, Wilshire, and 77th Street exhibit heightened activity across multiple crime types, particularly Burglary from Vehicle and Theft of Identity. In contrast, crimes such as Robbery and Vandalism appear more localized to specific areas. This visualization highlights crime hotspots for the most common offenses, offering valuable insights for targeted prevention and intervention strategies.

4. Model¶

Analysis based on Hypothesis¶

Relationship Betweeen Crime Type and Area¶
  • Hypothesis: Specific crime types are concentrated in certain areas. For instance, vehicle-related crimes might be more common in high traffic or urban areas.
  • Reasoning: The heatmap suggests certain crime types have hotspots in specific areas.

The dataset contains 23 columns with over 500,000 rows, including the following key attributes relevant to the hypothesis:

  • Crm_Cd_Desc: Describes the type of crime.
  • Area_Name: Provides the name of the area where the crime occurred.
  • Lat and Lon: Coordinates for geographical analysis.
  • Premis_Desc: Description of the location of the crime.
  • Date_Occ and Time_Occ: Provide date and time of occurrence.

To explore the relationship between crime types and areas, we will focus on Crm_Cd_Desc and Area_Name and analyze their distribution. We will also visualize potential hotspots using heatmaps or similar methods.

Let’s start by examining the most frequent crime types per area.

In [49]:
# Grouping data by Area_Name and Crm_Cd_Desc to find the most common crimes in each area
crime_area_group = (
    df.groupby(['Area_Name', 'Crm_Cd_Desc'])
    .size()
    .reset_index(name='Count')
)

# Finding the most frequent crime type per area
most_frequent_crimes_per_area = (
    crime_area_group.loc[crime_area_group.groupby('Area_Name')['Count'].idxmax()]
    .sort_values(by='Count', ascending=False)
)

# Display the results
most_frequent_crimes_per_area
Out[49]:
Area_Name Crm_Cd_Desc Count
94 Central BURGLARY FROM VEHICLE 8117
480 Hollywood BURGLARY FROM VEHICLE 3707
69 77th Street THEFT OF IDENTITY 3449
950 Pacific BURGLARY FROM VEHICLE 3147
1162 Southeast THEFT OF IDENTITY 3064
1439 West LA BURGLARY FROM VEHICLE 3004
637 N Hollywood BURGLARY FROM VEHICLE 2968
870 Olympic BURGLARY FROM VEHICLE 2852
1600 Wilshire BURGLARY FROM VEHICLE 2807
794 Northeast BURGLARY FROM VEHICLE 2672
227 Devonshire THEFT OF IDENTITY 2543
1354 Van Nuys BURGLARY FROM VEHICLE 2468
1271 Topanga BURGLARY 2457
712 Newton BURGLARY FROM VEHICLE 2408
1519 West Valley BURGLARY FROM VEHICLE 2319
1246 Southwest THEFT OF IDENTITY 2291
1036 Rampart BURGLARY FROM VEHICLE 2211
305 Foothill THEFT OF IDENTITY 2096
615 Mission THEFT OF IDENTITY 2032
454 Hollenbeck THEFT OF IDENTITY 1692
375 Harbor THEFT OF IDENTITY 1353

Identifies the most frequent crime type in each area by grouping the dataset by Area_Name and Crm_Cd_Desc, calculating counts, and filtering for the most common crime per area. The analysis highlights distinct crime patterns across regions.

Findings:¶
  1. "Burglary from Vehicle" is most common in areas like Central, Hollywood, and Pacific, with Central reporting the highest count (8,117 incidents).
  2. "Theft of Identity" dominates areas such as Southeast, West LA, and Devonshire.
  3. Topanga reports "Burglary" as the most frequent crime, showing regional variation.

1. Overall Crime Type Distribution¶

In [51]:
# Plot the overall crime type distribution
crime_type_counts = df['Crm_Cd_Desc'].value_counts().head(10)
# Retry plotting the overall crime type distribution
crime_type_counts.plot(kind='bar')
plt.title('Top 10 Crime Types')
plt.ylabel('Count')
plt.xlabel('Crime Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart visualizes the top 10 most frequent crime types, emphasizing the dominance of property-related offenses, particularly those involving vehicles and theft.

Findings:¶
  1. "Burglary from Vehicle" leads significantly with over 40,000 incidents, making it the most common crime.

  2. Crimes like "Theft of Identity", "Burglary", and "Theft from Motor Vehicle - Grand ($950.01 and Over)" also show high prevalence.

  3. Less frequent crimes, including "Robbery", "Vandalism - Misdemeanor ($399 or Under)", and "Brandish Weapon", still feature prominently in the dataset.

2. Top Crime Types by Area¶

In [54]:
# Group data by Area_Name and Crm_Cd_Desc to find the most common crimes in each area
crime_area_group = (
    df.groupby(['Area_Name', 'Crm_Cd_Desc'])
    .size()
    .reset_index(name='Count')
)

# Find the most frequent crime type per area
most_frequent_crimes_per_area = (
    crime_area_group.loc[crime_area_group.groupby('Area_Name')['Count'].idxmax()]
    .sort_values(by='Count', ascending=False)
)

# Get the top 10 areas with the highest count of a specific crime type
top_crimes_by_area = most_frequent_crimes_per_area.head(10)
plt.barh(top_crimes_by_area['Area_Name'], top_crimes_by_area['Count'])
plt.xlabel('Count')
plt.ylabel('Area Name')
plt.title('Top Crime Types by Area')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


# Display the results for analysis
most_frequent_crimes_per_area.head(10)
No description has been provided for this image
Out[54]:
Area_Name Crm_Cd_Desc Count
94 Central BURGLARY FROM VEHICLE 8117
480 Hollywood BURGLARY FROM VEHICLE 3707
69 77th Street THEFT OF IDENTITY 3449
950 Pacific BURGLARY FROM VEHICLE 3147
1162 Southeast THEFT OF IDENTITY 3064
1439 West LA BURGLARY FROM VEHICLE 3004
637 N Hollywood BURGLARY FROM VEHICLE 2968
870 Olympic BURGLARY FROM VEHICLE 2852
1600 Wilshire BURGLARY FROM VEHICLE 2807
794 Northeast BURGLARY FROM VEHICLE 2672

The table highlights the most frequent crime types in each area, revealing distinct patterns of geographic concentration and dominance of certain offenses.

Findings:¶
  1. "Burglary from Vehicle" is the leading crime in areas like Central (8,117 incidents), Hollywood, and Pacific.
  2. "Theft of Identity" is most common in areas such as 77th Street, Southeast, and Devonshire.
  3. West LA and North Hollywood also report high occurrences of "Burglary from Vehicle", underscoring its prevalence.

3. Temporal Analysis: Analyze crime trends over time¶

In [55]:
# Remove the last month in the dataset for temporal analysis
df['Date_Occ'] = pd.to_datetime(df['Date_Occ'], errors='coerce')
latest_month = df['Date_Occ'].max().month
latest_year = df['Date_Occ'].max().year

# Filter out the last month and create a copy to avoid warnings
filtered_data = df[
    ~((df['Date_Occ'].dt.month == latest_month) & (df['Date_Occ'].dt.year == latest_year))
].copy()  # Use .copy() here to ensure it's a new DataFrame

# Extract year and month for temporal analysis
filtered_data['Year'] = filtered_data['Date_Occ'].dt.year
filtered_data['Month'] = filtered_data['Date_Occ'].dt.month

# Group data by Year and Month for crime trends
temporal_trends_filtered = (
    filtered_data.groupby(['Year', 'Month'])
    .size()
    .reset_index(name='Crime_Count')
    .sort_values(by=['Year', 'Month'])
)

# Plotting the temporal trends without the last month
plt.figure(figsize=(14, 8))
plt.plot(
    temporal_trends_filtered['Year'].astype(str) + '-' + temporal_trends_filtered['Month'].astype(str),
    temporal_trends_filtered['Crime_Count'],
    marker='o'
)
plt.title('Crime Trends Over Time')
plt.xlabel('Time (Year-Month)')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

This analysis examines temporal trends in crime by grouping incidents by year and month, excluding the latest incomplete month to ensure accurate insights.

Findings:¶
  1. Crime counts generally increase over the analyzed period, peaking mid-way before showing a decline toward the end.
  2. Fluctuations in the trends suggest possible seasonal or external factors influencing criminal activity.

These findings highlight temporal patterns, aiding in better resource allocation and intervention planning.

4. Premises Analysis: Study relationships between crime types and locations.¶

In [56]:
# Group data by Premis_Desc and Crm_Cd_Desc to find the most common crime types at each location type
premises_crime_group = (
    df.groupby(['Premis_Desc', 'Crm_Cd_Desc'])
    .size()
    .reset_index(name='Count')
    .sort_values(by='Count', ascending=False)
)

# Get the top 10 premises with the most frequent crimes
top_premises_crimes = premises_crime_group.head(10)

# Plot the top premises for crimes
plt.barh(top_premises_crimes['Premis_Desc'], top_premises_crimes['Count'])
plt.xlabel('Number of Crimes')
plt.ylabel('Premises Description')
plt.title('Top Premises for Crimes')
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.tight_layout()
plt.show()
No description has been provided for this image

This analysis identifies the most common premises where crimes occur, focusing on top locations and their frequency through data grouping and visualization.

Key Findings:¶

  1. Single Family Dwelling: The most frequent crime location, with over 25,000 incidents, emphasizing residential areas as significant crime sites.
  2. Street: The second most common location, highlighting public spaces as key areas of concern.
  3. Parking Lot: The third most frequent site, pointing to potential security issues in these areas.

These findings underscore the need for targeted safety measures in both residential and public spaces to address crime effectively.

ML Analysis¶

GeoSpatial Analysis¶

In [57]:
df.head()
Out[57]:
DR_NO Date_Rptd Date_Occ Time_Occ Area Area_Name Rpt_Dist_No Part_1_2 Crm_Cd Crm_Cd_Desc ... Vict_Descent Premis_Cd Premis_Desc Status Status_Desc Location Lat Lon Year_Month Time_Binned
0 200106753 2020-02-09 2020-02-08 1800 1 Central 182 1 330 BURGLARY FROM VEHICLE ... O 128.0 'BUS STOP/LAYOVER (ALSO QUERY 124)' IC 'Invest Cont' '1000 S FLOWER ST' 34.0444 -118.2628 2020-02 Evening to Midnight
1 200907217 2023-05-10 2020-03-10 2037 9 Van Nuys 964 1 343 SHOPLIFTING-GRAND THEFT ($950.01 & OVER) ... O 405.0 'CLOTHING STORE' IC 'Invest Cont' '14000 RIVERSIDE DR' 34.1576 -118.4387 2020-03 Evening to Midnight
2 220614831 2022-08-18 2020-08-17 1200 6 Hollywood 666 2 354 THEFT OF IDENTITY ... H 102.0 SIDEWALK IC 'Invest Cont' '1900 TRANSIENT' 34.0944 -118.3277 2020-08 Noon to Evening
3 231808869 2023-04-04 2020-12-01 2300 18 Southeast 1826 2 354 THEFT OF IDENTITY ... H 501.0 'SINGLE FAMILY DWELLING' IC 'Invest Cont' '9900 COMPTON AV' 33.9467 -118.2463 2020-12 Evening to Midnight
4 220314085 2022-07-22 2020-05-12 1110 3 Southwest 303 2 354 THEFT OF IDENTITY ... B 248.0 'CELL PHONE STORE' IC 'Invest Cont' '2500 S SYCAMORE AV' 34.0335 -118.3537 2020-05 Morning to Noon

5 rows × 23 columns

In [58]:
# Create a geometry column from LAT/LON coordinates
geometry = [Point(lon, lat) for lon, lat in zip(df_cleaned['Lon'], df_cleaned['Lat'])]

# Create a GeoDataFrame
gdf = gpd.GeoDataFrame(df_cleaned, geometry=geometry)

# Set the coordinate reference system (CRS) to WGS84
gdf.set_crs(epsg=4326, inplace=True)

# Display the first few rows of the GeoDataFrame
gdf.head()
Out[58]:
DR_NO Date_Rptd Date_Occ Time_Occ Area Area_Name Rpt_Dist_No Part_1_2 Crm_Cd Crm_Cd_Desc ... Vict_Sex Vict_Descent Premis_Cd Premis_Desc Status Status_Desc Location Lat Lon geometry
0 200106753 2020-02-09 2020-02-08 1800 1 Central 182 1 330 'BURGLARY FROM VEHICLE' ... M O 128.0 'BUS STOP/LAYOVER (ALSO QUERY 124)' IC 'Invest Cont' '1000 S FLOWER ST' 34.0444 -118.2628 POINT (-118.2628 34.0444)
1 200907217 2023-05-10 2020-03-10 2037 9 'Van Nuys' 964 1 343 'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)' ... M O 405.0 'CLOTHING STORE' IC 'Invest Cont' '14000 RIVERSIDE DR' 34.1576 -118.4387 POINT (-118.4387 34.1576)
2 220614831 2022-08-18 2020-08-17 1200 6 Hollywood 666 2 354 'THEFT OF IDENTITY' ... M H 102.0 SIDEWALK IC 'Invest Cont' '1900 TRANSIENT' 34.0944 -118.3277 POINT (-118.3277 34.0944)
3 231808869 2023-04-04 2020-12-01 2300 18 Southeast 1826 2 354 'THEFT OF IDENTITY' ... M H 501.0 'SINGLE FAMILY DWELLING' IC 'Invest Cont' '9900 COMPTON AV' 33.9467 -118.2463 POINT (-118.2463 33.9467)
4 220314085 2022-07-22 2020-05-12 1110 3 Southwest 303 2 354 'THEFT OF IDENTITY' ... F B 248.0 'CELL PHONE STORE' IC 'Invest Cont' '2500 S SYCAMORE AV' 34.0335 -118.3537 POINT (-118.3537 34.0335)

5 rows × 22 columns

Converts the cleaned dataset into a geospatial format for mapping and spatial analysis.

Steps:¶
  1. Create Geometry Column:

    • Combines latitude (Lat) and longitude (Lon) coordinates into Point objects for each record using the shapely.geometry.Point class.
  2. Create GeoDataFrame:

    • Converts the df_cleaned DataFrame into a GeoDataFrame (gdf) using geopandas.GeoDataFrame, incorporating the geometry column.
  3. Set Coordinate Reference System (CRS):

    • Sets the CRS to WGS84 (EPSG:4326), a standard for geographic coordinates, enabling accurate mapping and geospatial analysis.
  4. Preview GeoDataFrame:

    • Displays the first 5 rows of the GeoDataFrame, which now includes a geometry column for spatial representation.

Purpose:¶

This prepares the dataset for geospatial analysis, allowing crimes to be visualized on maps and enabling spatial queries to identify trends or hotspots.

In [59]:
# Create a map centered around the mean latitude and longitude of the crime locations
map_center = [df_cleaned['Lat'].mean(), df_cleaned['Lon'].mean()]

# Prepare data for HeatMap (LAT/LON coordinates)
heat_data = [[row['Lat'], row['Lon']] for _, row in df_cleaned.iterrows()]

# Create the HeatMap
heatmap = folium.Map(location=map_center, zoom_start=12)
HeatMap(heat_data).add_to(heatmap)

# Display the heatmap
heatmap
Out[59]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The heatmap visualizes crime density, with red areas indicating hotspots of high activity, primarily in central and urban regions.

Insights:¶
  1. High crime concentrations are visible in central areas and densely populated urban zones.
  2. Peripheral areas show significantly lower crime density.
  3. This visualization highlights where law enforcement and public safety measures should be prioritized.
In [61]:
# Create the figure and axis
fig, ax = plt.subplots(figsize=(20, 20))

# Plot the GeoDataFrame on a Matplotlib axis
gdf.plot(ax=ax, color='red', markersize=1)

# Set axis limits
ax.set_xlim(-118.8, -118)
ax.set_ylim(33.7, 34.35)

# Set labels and title
ax.set_title('Crime Locations')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')

# Show the plot
plt.show()
No description has been provided for this image

This scatter plot visualizes individual crime locations across the region using their latitude and longitude coordinates.

Insights:¶
  1. The points form a detailed outline of the mapped area, indicating widespread crime occurrences.
  2. Densely packed clusters represent urban areas with higher crime activity.
  3. Sparse points highlight regions with lower crime occurrences, likely less populated or rural.

This visualization provides a comprehensive spatial overview of crime distribution, aiding in identifying high and low-crime regions.

Predictive Modeling (Crime Prevention) using Random Forest Classifier and ANN.¶

Random Forest Classifier¶

In [62]:
# Encode categorical variables
df['Vict_Sex'] = df['Vict_Sex'].astype('category').cat.codes
df['Vict_Descent'] = df['Vict_Descent'].astype('category').cat.codes
df['Crm_Cd'] = df['Crm_Cd'].astype('category').cat.codes

# Create target variable (e.g., crime type or severity)
df['Target'] = df['Part_1_2'].astype('category').cat.codes 

The cell encodes categorical variables (Vict_Sex, Vict_Descent, Crm_Cd, and Part_1_2) into numerical codes for machine learning or statistical analysis, creating a target variable Target based on Part_1_2.

In [63]:
# Select relevant features for modeling
features = ['Lat', 'Lon', 'Vict_Age', 'Vict_Sex', 'Vict_Descent', 'Crm_Cd']
X = df[features]
y = df['Target']

The code selects relevant features (Lat, Lon, Vict_Age, Vict_Sex, Vict_Descent, Crm_Cd) for modeling as X and defines the target variable y as Target.

In [64]:
# Initialize K-Fold cross-validator
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=64)

This code initializes a stratified K-Fold cross-validator with 10 splits, shuffling the data to ensure randomness and preserving the class distribution using a random seed (random_state=64).

In [65]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=50, max_depth=10, max_features=2, random_state=64)

# Perform cross-validation
cv_scores = cross_val_score(rf_model, X, y, cv=kfold, scoring='accuracy')

#Cross-Validation scores
print("10-Fold Cross-Validation Accuracy Scores: \n", cv_scores)
print("\n Mean CV Accuracy: \n", np.mean(cv_scores))
print("\n Standard Deviation of CV Accuracy: \n", np.std(cv_scores))

# Train and test on the whole dataset for a single split as an example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=64)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model on the test set
print("\nRandom Forest Classifier Performance on Test Set:")
print(classification_report(y_test, y_pred_rf))
print("Test Set Accuracy:", accuracy_score(y_test, y_pred_rf))
10-Fold Cross-Validation Accuracy Scores: 
 [0.92434026 0.92434026 0.92430966 0.92421771 0.92348345 0.92088295
 0.92562504 0.92158661 0.92351404 0.92544147]

 Mean CV Accuracy: 
 0.923774144923883

 Standard Deviation of CV Accuracy: 
 0.0014362023158888083

Random Forest Classifier Performance on Test Set:
              precision    recall  f1-score   support

           0       0.90      0.89      0.93     85851
           1       0.94      0.95      0.91     44895

    accuracy                           0.92    130746
   macro avg       0.92      0.91      0.90    130746
weighted avg       0.92      0.92      0.92    130746

Test Set Accuracy: 0.9223362856225048
In [67]:
print(confusion_matrix(y_test, y_pred_rf))
[[76406   9445]
 [  2245 42650]]

50 trees, a maximum depth of 10, and two features taken into account at each split characterize the Random Forest Classifier. Using the previously defined stratified folds, it conducts a 10-fold cross-validation, determines accuracy scores for each fold, and outputs the mean accuracy along with its standard deviation. It divides the dataset into training (60%) and testing (40%) subsets, trains the model on the training set, and assesses it on the test set as an illustration of training and testing. Metrics from a classification report and the total test accuracy are included in the evaluation, which provide information about the robustness and performance of the model.

With an average 10-fold cross-validation accuracy of 92.38% and a low standard deviation of 0.14%, the Random Forest Classifier performs well and consistently across folds. The model's accuracy on the test set is 92.23%, with both classes' precision and recall being balanced. The model's capacity to manage both classes is demonstrated by the high f1-scores, especially for class 0 (0.93) and class 1 (0.91). The model is dependable for forecasting the target variable in this dataset, as seen by the weighted average metrics, which validate strong overall performance.

Sequential Model (Artificial Neural Network) - Dense Layers¶

In [68]:
# Normalize the features for better performance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Normalize features using StandardScaler to improve model performance and convergence.

In [71]:
# K-Fold Cross-Validation
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=64)

# Store metrics for each fold
fold_accuracies = []
fold_reports = []

Performing K-Fold Cross-Validation with Stratified Splits and Record Metrics.

In [72]:
for fold, (train_idx, test_idx) in enumerate(kfold.split(X_scaled, y)):
    print(f"Training Fold {fold + 1}...")
    
    # Split data into train and test for this fold
    X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Define the ANN model
    model = Sequential([
        Input(shape=(X_train.shape[1],)),  # Specify the input shape here
        Dense(units=64, activation='relu'),
        BatchNormalization(),
        Dropout(0.2),
        Dense(units=32, activation='relu'),
        BatchNormalization(),
        Dropout(0.3),
        Dense(units=16, activation='relu'),
        Dropout(0.3),
        Dense(units=1, activation='sigmoid')  # Output layer for binary classification
    ])

    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.001),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    # Early stopping
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

    # Train the model
    model.fit(X_train, y_train,
              validation_split=0.2,
              epochs=10,
              batch_size=64,
              callbacks=[early_stopping],
              verbose=1)

    # Evaluate the model on the fold test set
    y_pred_prob = model.predict(X_test)
    y_pred = (y_pred_prob > 0.5).astype(int).flatten()

    # Calculate accuracy for this fold
    accuracy = accuracy_score(y_test, y_pred)
    fold_accuracies.append(accuracy)

    # Store classification report
    report = classification_report(y_test, y_pred, output_dict=True)
    fold_reports.append(report)

    print(f"Fold {fold + 1} Accuracy: {accuracy:.2f}")

# Display overall performance
mean_accuracy = np.mean(fold_accuracies)
std_accuracy = np.std(fold_accuracies)

print("\nK-Fold Cross-Validation Results:")
print(f"Mean Accuracy: {mean_accuracy:.2f}")
print(f"Standard Deviation of Accuracy: {std_accuracy:.2f}")
Training Fold 1...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 26s 5ms/step - accuracy: 0.8033 - loss: 0.3829 - val_accuracy: 0.8311 - val_loss: 0.1623
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 4ms/step - accuracy: 0.8102 - loss: 0.1982 - val_accuracy: 0.8242 - val_loss: 0.0818
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 18s 5ms/step - accuracy: 0.8434 - loss: 0.1400 - val_accuracy: 0.8592 - val_loss: 0.0501
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 4ms/step - accuracy: 0.8584 - loss: 0.1124 - val_accuracy: 0.8385 - val_loss: 0.0433
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 18s 5ms/step - accuracy: 0.8682 - loss: 0.0996 - val_accuracy: 0.8696 - val_loss: 0.0415
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 18s 5ms/step - accuracy: 0.8512 - loss: 0.0936 - val_accuracy: 0.8693 - val_loss: 0.0394
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 15s 4ms/step - accuracy: 0.8618 - loss: 0.0922 - val_accuracy: 0.8697 - val_loss: 0.0372
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 17s 5ms/step - accuracy: 0.8728 - loss: 0.0904 - val_accuracy: 0.8747 - val_loss: 0.0379
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - accuracy: 0.8643 - loss: 0.0874 - val_accuracy: 0.8672 - val_loss: 0.0397
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - accuracy: 0.8712 - loss: 0.0861 - val_accuracy: 0.8759 - val_loss: 0.0355
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step
Fold 1 Accuracy: 0.87
Training Fold 2...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 26s 5ms/step - accuracy: 0.8029 - loss: 0.3911 - val_accuracy: 0.8118 - val_loss: 0.1656
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 17s 4ms/step - accuracy: 0.8599 - loss: 0.2124 - val_accuracy: 0.8653 - val_loss: 0.0862
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 22s 5ms/step - accuracy: 0.8378 - loss: 0.1483 - val_accuracy: 0.8686 - val_loss: 0.0508
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 17s 4ms/step - accuracy: 0.8562 - loss: 0.1159 - val_accuracy: 0.8761 - val_loss: 0.0468
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 18s 5ms/step - accuracy: 0.8617 - loss: 0.1055 - val_accuracy: 0.8727 - val_loss: 0.0466
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 22s 5ms/step - accuracy: 0.8643 - loss: 0.1017 - val_accuracy: 0.8754 - val_loss: 0.0426
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 4ms/step - accuracy: 0.8660 - loss: 0.0975 - val_accuracy: 0.8700 - val_loss: 0.0409
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 18s 5ms/step - accuracy: 0.8669 - loss: 0.0964 - val_accuracy: 0.8702 - val_loss: 0.0359
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 22s 5ms/step - accuracy: 0.8678 - loss: 0.0939 - val_accuracy: 0.8718 - val_loss: 0.0372
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 18s 5ms/step - accuracy: 0.8693 - loss: 0.0889 - val_accuracy: 0.8764 - val_loss: 0.0366
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step
Fold 2 Accuracy: 0.87
Training Fold 3...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 31s 6ms/step - accuracy: 0.7958 - loss: 0.4025 - val_accuracy: 0.8174 - val_loss: 0.1676
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 24s 7ms/step - accuracy: 0.8587 - loss: 0.2153 - val_accuracy: 0.8668 - val_loss: 0.0966
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - accuracy: 0.8351 - loss: 0.1542 - val_accuracy: 0.8575 - val_loss: 0.0599
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 24s 6ms/step - accuracy: 0.8524 - loss: 0.1234 - val_accuracy: 0.8675 - val_loss: 0.0491
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - accuracy: 0.8590 - loss: 0.1115 - val_accuracy: 0.8790 - val_loss: 0.0463
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - accuracy: 0.8615 - loss: 0.1092 - val_accuracy: 0.8768 - val_loss: 0.0476
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 17s 5ms/step - accuracy: 0.8650 - loss: 0.1019 - val_accuracy: 0.8754 - val_loss: 0.0509
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 14s 4ms/step - accuracy: 0.8646 - loss: 0.1027 - val_accuracy: 0.8794 - val_loss: 0.0420
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 14s 4ms/step - accuracy: 0.8667 - loss: 0.0959 - val_accuracy: 0.8693 - val_loss: 0.0381
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 26s 7ms/step - accuracy: 0.8677 - loss: 0.0937 - val_accuracy: 0.8790 - val_loss: 0.0383
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step
Fold 3 Accuracy: 0.87
Training Fold 4...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 20s 4ms/step - accuracy: 0.7828 - loss: 0.4175 - val_accuracy: 0.8791 - val_loss: 0.1822
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 4ms/step - accuracy: 0.8546 - loss: 0.2270 - val_accuracy: 0.8721 - val_loss: 0.1170
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8307 - loss: 0.1633 - val_accuracy: 0.8709 - val_loss: 0.0621
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8573 - loss: 0.1277 - val_accuracy: 0.8773 - val_loss: 0.0509
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8618 - loss: 0.1122 - val_accuracy: 0.8734 - val_loss: 0.0515
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8628 - loss: 0.1050 - val_accuracy: 0.8735 - val_loss: 0.0505
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 4ms/step - accuracy: 0.8643 - loss: 0.1010 - val_accuracy: 0.8767 - val_loss: 0.0441
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8678 - loss: 0.0955 - val_accuracy: 0.8760 - val_loss: 0.0446
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 20s 3ms/step - accuracy: 0.8689 - loss: 0.0952 - val_accuracy: 0.8764 - val_loss: 0.0452
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8706 - loss: 0.0885 - val_accuracy: 0.8783 - val_loss: 0.0364
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step
Fold 4 Accuracy: 0.87
Training Fold 5...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 3ms/step - accuracy: 0.8006 - loss: 0.3741 - val_accuracy: 0.8042 - val_loss: 0.1687
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8033 - loss: 0.2065 - val_accuracy: 0.8497 - val_loss: 0.1006
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8369 - loss: 0.1506 - val_accuracy: 0.8695 - val_loss: 0.0797
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8506 - loss: 0.1294 - val_accuracy: 0.8662 - val_loss: 0.0746
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 21s 4ms/step - accuracy: 0.8601 - loss: 0.1123 - val_accuracy: 0.8743 - val_loss: 0.0544
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8666 - loss: 0.1019 - val_accuracy: 0.8781 - val_loss: 0.0447
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8698 - loss: 0.0949 - val_accuracy: 0.8768 - val_loss: 0.0448
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8712 - loss: 0.0912 - val_accuracy: 0.8794 - val_loss: 0.0404
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8734 - loss: 0.0854 - val_accuracy: 0.8763 - val_loss: 0.0383
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8735 - loss: 0.0866 - val_accuracy: 0.8769 - val_loss: 0.0368
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step
Fold 5 Accuracy: 0.87
Training Fold 6...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 3ms/step - accuracy: 0.8072 - loss: 0.3790 - val_accuracy: 0.8295 - val_loss: 0.1568
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8096 - loss: 0.1973 - val_accuracy: 0.8733 - val_loss: 0.0650
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8463 - loss: 0.1351 - val_accuracy: 0.8776 - val_loss: 0.0460
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8603 - loss: 0.1102 - val_accuracy: 0.8754 - val_loss: 0.0474
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8636 - loss: 0.0999 - val_accuracy: 0.8701 - val_loss: 0.0410
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8670 - loss: 0.0936 - val_accuracy: 0.8702 - val_loss: 0.0378
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 4ms/step - accuracy: 0.8696 - loss: 0.0863 - val_accuracy: 0.8701 - val_loss: 0.0362
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8699 - loss: 0.0873 - val_accuracy: 0.8709 - val_loss: 0.0373
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 4ms/step - accuracy: 0.8719 - loss: 0.0841 - val_accuracy: 0.8773 - val_loss: 0.0397
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 14s 4ms/step - accuracy: 0.8722 - loss: 0.0812 - val_accuracy: 0.8745 - val_loss: 0.0342
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step
Fold 6 Accuracy: 0.87
Training Fold 7...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 22s 5ms/step - accuracy: 0.8029 - loss: 0.3773 - val_accuracy: 0.8185 - val_loss: 0.1651
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 15s 4ms/step - accuracy: 0.8074 - loss: 0.1983 - val_accuracy: 0.8782 - val_loss: 0.0826
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 14s 4ms/step - accuracy: 0.8444 - loss: 0.1404 - val_accuracy: 0.8782 - val_loss: 0.0742
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8642 - loss: 0.1070 - val_accuracy: 0.8755 - val_loss: 0.0418
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8700 - loss: 0.0937 - val_accuracy: 0.8749 - val_loss: 0.0388
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 4ms/step - accuracy: 0.8731 - loss: 0.0876 - val_accuracy: 0.8796 - val_loss: 0.0367
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 4ms/step - accuracy: 0.8734 - loss: 0.0861 - val_accuracy: 0.8797 - val_loss: 0.0385
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 21s 6ms/step - accuracy: 0.8730 - loss: 0.0882 - val_accuracy: 0.8796 - val_loss: 0.0401
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - accuracy: 0.8765 - loss: 0.0826 - val_accuracy: 0.8793 - val_loss: 0.0369
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 14s 3ms/step - accuracy: 0.8755 - loss: 0.0816 - val_accuracy: 0.8797 - val_loss: 0.0371
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 4s 4ms/step
Fold 7 Accuracy: 0.87
Training Fold 8...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 18s 4ms/step - accuracy: 0.7974 - loss: 0.3898 - val_accuracy: 0.8356 - val_loss: 0.1557
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 11s 3ms/step - accuracy: 0.8110 - loss: 0.1977 - val_accuracy: 0.8792 - val_loss: 0.0837
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8473 - loss: 0.1382 - val_accuracy: 0.8764 - val_loss: 0.0499
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8657 - loss: 0.1052 - val_accuracy: 0.8766 - val_loss: 0.0485
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8713 - loss: 0.0936 - val_accuracy: 0.8783 - val_loss: 0.0416
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 21s 3ms/step - accuracy: 0.8736 - loss: 0.0884 - val_accuracy: 0.8775 - val_loss: 0.0452
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 4ms/step - accuracy: 0.8747 - loss: 0.0842 - val_accuracy: 0.8760 - val_loss: 0.0557
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8741 - loss: 0.0842 - val_accuracy: 0.8782 - val_loss: 0.0354
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8751 - loss: 0.0815 - val_accuracy: 0.8794 - val_loss: 0.0371
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 4ms/step - accuracy: 0.8766 - loss: 0.0782 - val_accuracy: 0.8781 - val_loss: 0.0339
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step
Fold 8 Accuracy: 0.87
Training Fold 9...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 3ms/step - accuracy: 0.8121 - loss: 0.3628 - val_accuracy: 0.8229 - val_loss: 0.1646
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8034 - loss: 0.2000 - val_accuracy: 0.8434 - val_loss: 0.0708
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8424 - loss: 0.1388 - val_accuracy: 0.8529 - val_loss: 0.0604
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8597 - loss: 0.1189 - val_accuracy: 0.8779 - val_loss: 0.0428
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 21s 3ms/step - accuracy: 0.8647 - loss: 0.1054 - val_accuracy: 0.8794 - val_loss: 0.0409
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8663 - loss: 0.0974 - val_accuracy: 0.8793 - val_loss: 0.0409
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8688 - loss: 0.0926 - val_accuracy: 0.8796 - val_loss: 0.0384
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8708 - loss: 0.0899 - val_accuracy: 0.8793 - val_loss: 0.0394
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8717 - loss: 0.0872 - val_accuracy: 0.8789 - val_loss: 0.0385
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8719 - loss: 0.0871 - val_accuracy: 0.8772 - val_loss: 0.0359
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step
Fold 9 Accuracy: 0.87
Training Fold 10...
Epoch 1/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 16s 4ms/step - accuracy: 0.8065 - loss: 0.3794 - val_accuracy: 0.8321 - val_loss: 0.1768
Epoch 2/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8058 - loss: 0.2027 - val_accuracy: 0.8733 - val_loss: 0.0900
Epoch 3/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8383 - loss: 0.1473 - val_accuracy: 0.8770 - val_loss: 0.0708
Epoch 4/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8558 - loss: 0.1221 - val_accuracy: 0.8697 - val_loss: 0.0508
Epoch 5/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8651 - loss: 0.1041 - val_accuracy: 0.8709 - val_loss: 0.0462
Epoch 6/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 13s 3ms/step - accuracy: 0.8681 - loss: 0.0981 - val_accuracy: 0.8781 - val_loss: 0.0419
Epoch 7/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8700 - loss: 0.0952 - val_accuracy: 0.8792 - val_loss: 0.0427
Epoch 8/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8728 - loss: 0.0884 - val_accuracy: 0.8775 - val_loss: 0.0396
Epoch 9/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8726 - loss: 0.0880 - val_accuracy: 0.8798 - val_loss: 0.0382
Epoch 10/10
3678/3678 ━━━━━━━━━━━━━━━━━━━━ 12s 3ms/step - accuracy: 0.8738 - loss: 0.0857 - val_accuracy: 0.8762 - val_loss: 0.0364
1022/1022 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step
Fold 10 Accuracy: 0.87

K-Fold Cross-Validation Results:
Mean Accuracy: 0.87
Standard Deviation of Accuracy: 0.03
In [74]:
print(confusion_matrix(y_test, y_pred))
[[18651   2647]
 [  1616 9483]]
In [75]:
model.summary()
Model: "sequential_10"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_40 (Dense)                     │ (None, 64)                  │             448 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_20               │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_30 (Dropout)                 │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_41 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_21               │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_31 (Dropout)                 │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_42 (Dense)                     │ (None, 16)                  │             528 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_32 (Dropout)                 │ (None, 16)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_43 (Dense)                     │ (None, 1)                   │              17 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 9,989 (39.02 KB)
 Trainable params: 3,265 (12.75 KB)
 Non-trainable params: 192 (768.00 B)
 Optimizer params: 6,532 (25.52 KB)
In [76]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.88      0.92      0.90     21298
           1       0.85      0.78      0.82     11099

    accuracy                           0.87     32686
   macro avg       0.87      0.85      0.86     32686
weighted avg       0.87      0.87      0.87     32686

An artificial neural network's (ANN) performance in binary classification is assessed using the K-Fold Cross-Validation process. The dataset is divided into ten stratified folds using StratifiedKFold, guaranteeing uniformity in the class distribution throughout each fold. A sequential ANN model is defined and trained for each fold once the data is separated into training and testing sets. Multiple dense layers with ReLU activations, batch normalization for stability, dropout for regularization, and a sigmoid activation in the output layer for binary classification make up the network. The Adam optimizer and binary cross-entropy loss are used to assemble the model, and early halting is used to avoid overfitting. In addition to the classification report, the accuracy of each fold is computed and recorded, offering information on model performance across folds.

A mean accuracy of 87% with a standard deviation of 0.03 was obtained from the evaluation of the artificial neural network (ANN) model using K-Fold Cross-Validation. With little change in accuracy, this suggests that the model operated consistently across the folds. With 2,647 false positives and 1,616 false negatives, 9,483 true positives and 18,651 true negatives accurately categorized, the confusion matrix further demonstrates the model's effectiveness. These findings imply that while the model is good at differentiating across classes, it might do a better job of lowering misclassifications, especially erroneous positives and false negatives. Overall, the ANN has a strong capacity for dataset generalization while preserving dependable prediction accuracy.

5. Interpretation of Results¶

Understanding the patterns that drive crime distribution is critical for making sound judgments about resource allocation and public safety measures. Our study aimed to investigate the theory that particular types of crimes are concentrated in specific places, such as vehicle-related crimes being more common in high-traffic urban areas. We tested the efficacy of machine learning models such as Random Forest and Artificial Neural Networks (ANN) to predict crime types based on geographical and contextual variables.

Random Forest Results:¶

The Random Forest model outperformed in this job, with a mean cross-validation accuracy of 92.38% and a minimum standard deviation of 0.14%, indicating consistent performance over folds. On the test set, the model achieved 92.23% accuracy and a weighted F1-score of 0.92. The confusion matrix revealed 76,406 true negatives, 42,650 true positives, 9,445 false positives, and 2,245 false negatives, demonstrating its accuracy in distinguishing various crime categories. However, Type I errors (false positives) remain slightly higher, indicating room for improvement, maybe through feature engineering or more detailed hyperparameter optimization.

Artificial Neural Network (ANN) Results:¶

The ANN model had somewhat lower performance metrics, with a cross-validation mean accuracy of 87% and a standard deviation of 0.03, indicating reasonable reliability. On the test set, the ANN achieved an accuracy of 87.23% with a weighted F1-score of 0.87. The confusion matrix identified 18,651 true negatives, 9,483 true positives, 2,647 erroneous positives, and 1,616 false negatives. When compared to Random Forest, the ANN performed worse with Type II mistakes, underperforming in distinguishing true positive cases. This finding underscores the ANN's reliance on larger datasets and the potential for improvement by adding more features or modifying the architecture.

Analysis and Key Takeaways:¶

Both models' findings strongly support the hypothesis, demonstrating patterns in crime distribution that correspond to spatial trends in the data. The Random Forest model revealed to be the most robust and trustworthy alternative for this investigation, surpassing the ANN in the majority of measures. However, all models demonstrated limits in dealing with false positives and false negatives, implying that additional contextual features—such as traffic patterns, population density, or time of day—could improve prediction accuracy.

Future Improvements:¶

The findings show that, while our models capture the overall patterns, there is still space for improvement. Improving the dataset by adding new factors such as weather, socioeconomic indices, or proximity to key landmarks may yield more detailed insights. Furthermore, experimenting with advanced ensemble techniques or hybrid systems that combine Random Forest and ANN could help reduce errors. Iterating on these discoveries by refining models and features will provide a clearer view of crime distribution patterns and more precise predictions, bringing them closer to real-world circumstances.

6. Conclusion¶

My models did not achieve the amount of resilience I had hoped for, but it is a normal part of the data science process. Regardless, the adventure has been extremely beneficial. I was able to find patterns in crime data, efficiently process and clean it, verify my theory regarding crime type concentration, and forecast the likelihood of specific crimes in specific places. While the existing data are insufficient to influence important policy decisions, they do suggest areas for improvement and serve as a solid framework for future research.

This project allowed me to work through the key stages of the data science lifecycle in a practical context:

  1. Data Collection: Gathering detailed crime records from 2020 onward to analyze spatial and contextual factors.
  2. Data Processing: Cleaning and preparing the data to ensure consistency and relevance for my models.
  3. Exploratory Analysis and Visualization: Using heatmaps and other tools to uncover trends, such as hotspots for vehicle-related crimes.
  4. Model Analysis and Testing: Training and evaluating Random Forest and ANN models, understanding their strengths and limitations.
  5. Interpretation of Results: Drawing insights from the models, such as the need for additional features to improve predictions.

This exploration reaffirmed my belief that data science is iterative—results frequently lead to new questions and opportunity to better methodologies. My Random Forest model, for example, performed well with a 92% accuracy, but adding characteristics such as traffic patterns or socioeconomic data could lead to even better results. The ANN model, despite obtaining a lower accuracy of 87%, identified areas where design changes or extra data could improve performance.

This initiative is an important step forward in my career as a data scientist. Every step of the process, from developing hypotheses to evaluating results, has increased my understanding of how data can be used to solve real-world problems. I'm driven to keep refining my technique, adding additional layers of complexity, and eventually contribute to significant solutions that help analyze and manage criminal patterns.

7. References¶

  1. Mohammad Nayeem Teli, "MSML602/DATA602/BIOL602 Principles of Data Science - Final Tutorial Instructions," University of Maryland.
  2. The Effect of Storms in the United States: An example tutorial illustrating data analysis and visualization. Link.
  3. An Evaluation of American Presidential Elections: Demonstrates hypothesis testing and modeling in data science. Link.
  4. Analysis of S&P 500 Companies: Showcasing exploratory analysis and financial modeling. Link.
  5. City Bike Planning: Analyzing bike usage trends to inform city planning. Link.
  6. Scikit-learn Documentation: Random Forest Classifier. Link.
  7. Keras Sequential API Documentation. Link.
  8. Python Data Science Handbook by Jake VanderPlas. Link.
  9. Random Forests by Leo Breiman. Link.
  10. ChatGPT by OpenAI: An AI language model developed by OpenAI. Link.
  11. Crime Data from 2020 to Present | Los Angeles - Open Data Portal. Link.
  12. LAPD Releases End of Year Crime Statistics for the City of Los Angeles 2023. Link.
  13. Crime Mapping and COMPSTAT - LAPD Online. Link.
  14. Crime and Arrest Statistics - Los Angeles County Sheriff's Department. Link.
  15. Los Angeles Crime Rates and Statistics - NeighborhoodScout. Link.
  16. LAPD 2023 Stats Show Homicides and Violent Crime Down, Property Crime and Thefts Up. Link.
  17. Violent Crime in Los Angeles Decreased in 2023. But Officials Worry the... Link.